CONCLUSIONS

For this project, I wanted to learn more about NHL player salary. I explore salary in the context of different determinants, including country, US city, draft year, and team. I also use a random forest regression model to predict salary and identify the most important features in prediction.

Note: see the data source (https://www.kaggle.com/datasets/camnugent/predict-nhl-player-salaries) for a complete list of column names and their abbreviations.

The median salary overall was $925,000. The maximum salary was $13,800,000, and the minimum was $575,000. Left-handed players are paid slightly more than right-handed players, with a 75th percentile salary nearly 13% higher. However, some of this difference might be attributed to more left-handed players being present in the dataset. Canada and the US had at least one player receiving the maximum salary. The US median matches the overall median, and the Canada median is $950,000. Seeing relatively high median salaries are the US cities of Madison, Minneapolis, and St. Paul, each matching or surpassing $4,000,000. Ann Arbor matches the overall median, and Detroit just falls short of it.

Among the most important determining features are date of birth (age), draft year, time on ice divided by games played, team goals while player on ice, and team shots on goal while player on ice. The random forest regression model predicts salary with a mean absolute error of around $1,000,000, which is about 8% of the overall range. The root mean square error is around $1,600,000, which is close to 12% of the overall range.

Improved feature engineering and feature selection might potentially increase the accuracy of this model further. Some columns could be split up into separate columns (e.g.: date of birth into year of birth, month, etc.), and some could be eliminated due to similarity (e.g.: CF (team’s shot attempts while player on ice), FF (team’s unblocked attempts while player on ice), SF (team’s shots on goal while player on ice)). An ensemble method might be useful as well. With three models created (one with all columns, one with negative importance columns removed, and one with a hyperparameter-tuned second model), accuracy improved primarily in the ten thousands range. The set of “important features,” as determined by IncNodePurity and %IncMSE charts, remained relatively consistent throughout model improvement.

salary_data <- read.csv("train.csv")
head(salary_data)
##    Salary     Born         City Pr.St Cntry Nat Ht  Wt DftYr DftRd Ovrl Hand
## 1  925000 97-01-30 Sainte-Marie    QC   CAN CAN 74 190  2015     1   18    L
## 2 2250000 93-12-21       Ottawa    ON   CAN CAN 74 207  2012     1   15    R
## 3 8000000 88-04-16     St. Paul    MN   USA USA 72 218  2006     1    7    R
## 4 3500000 92-01-07       Ottawa    ON   CAN CAN 77 220  2010     1    3    R
## 5 1750000 94-03-29      Toronto    ON   CAN CAN 76 217  2012     1   16    R
## 6 1500000 79-05-23    Strathroy    ON   CAN CAN 70 192  1997     6  156    L
##    Last.Name First.Name Position Team GP  G  A A1 A2 PTS X...  E... PIM Shifts
## 1     Chabot     Thomas        D  OTT  1  0  0  0  0   0   -2   0.0   0     13
## 2       Ceci       Cody        D  OTT 79  2 15  6  9  17  -11 -10.4  20   2418
## 3     Okposo       Kyle       RW  BUF 65 19 26 13 13  45   -7  -1.4  24   1443
## 4 Gudbranson       Erik        D  VAN 30  1  5  5  0   6  -14  -5.3  18    765
## 5     Wilson        Tom       RW  WSH 82  7 12  4  8  19    9   4.1 133   1453
## 6   Campbell      Brian        D  CHI 80  5 12  6  6  17   12   0.7  24   1896
##      TOI   TOIX TOI.GP TOI.GP.1 TOI. IPP. SH.   SV.  PDO F.60  A.60 Pct. Diff
## 1    429    7.2   7.15     7.16 15.2  0.0 0.0 0.750  750 0.00 16.74  0.0   -2
## 2 109992 1826.2  23.20    23.17 39.0 30.4 7.4 0.915  989 1.84  2.79 39.7  -29
## 3  73983 1229.2  18.97    18.95 33.1 63.4 9.7 0.934 1031 3.47  1.95 64.0   31
## 4  36603  607.9  20.33    20.31 36.1 37.5 6.2 0.897  959 1.58  3.45 31.4  -19
## 5  63592 1059.7  12.93    12.93 23.5 61.3 7.8 0.917  995 1.76  2.32 43.1  -10
## 6  88462 1473.7  18.43    18.43 32.4 23.9 9.8 0.936 1033 2.89  1.91 60.2   24
##   Diff.60 iCF iCF.1 iFF iSF iSF.1 iSF.2  ixG iSCF iRB iRS iDS sDist sDist.1
## 1  -16.74   2     2   2   1     1     1  0.0    0   0   0   0  43.0    49.3
## 2   -0.95 287   287 197 143   143   143  6.1    7   7   9  16  52.4    46.3
## 3    1.51 283   283 212 155   156   156 17.4   64  16  20  36  28.4    26.3
## 4   -1.88  88    88  55  40    40    40  1.4    2   1   4   5  55.1    51.0
## 5   -0.57 166   166 118  95    95    95  9.3   35   8  10  18  30.9    26.4
## 6    0.98 171   171 110  75    74    75  4.5    7   2   3   5  46.1    41.9
##    Pass iHF iHF.1 iHA iHDf iMiss iGVA iTKA iBLK iGVA.1 iTKA.1 iBLK.1 BLK. iFOW
## 1   0.0   1     1   0    1     1    1    0    0      1      0      0  0.0    0
## 2 138.1 111   111 154  -43    54   74   22  159     74     22    159  8.0    1
## 3 196.8  53    53  68  -15    57   36   26   25     36     26     25  2.4   54
## 4 153.0  66    66  66    0    15   23    4   44     23      4     44  7.3    0
## 5  96.3 239   239 134  105    23   21   36   44     21     36     44  4.4    3
## 6  95.7  43    43 157 -114    35   59   11   83     59     11     83  6.5    0
##   iFOL iFOW.1 iFOL.1   FO. X.FOT dzFOW dzFOL nzFOW nzFOL ozFOW ozFOL FOW.Up
## 1    0      0      0   0.0   0.0     0     0     0     0     0     0      0
## 2    0      1      0 100.0   0.1     1     0     0     0     0     0      0
## 3   45     54     45  54.5   7.4     9     6    10    11    35    28     13
## 4    0      0      0   0.0   0.0     0     0     0     0     0     0      0
## 5    7      3      7  30.0   1.0     1     1     0     2     2     4      3
## 6    0      0      0   0.0   0.0     0     0     0     0     0     0      0
##   FOL.Up FOW.Down FOL.Down FOW.Close FOL.Close OTG X1G GWG ENG PSG PSA G.Bkhd
## 1      0        0        0         0         0   0   0   0   0   0   0      0
## 2      0        0        0         1         0   0   0   0   0   0   0      0
## 3     10       21       16        37        33   1   5   2   0   0   0      5
## 4      0        0        0         0         0   0   0   1   0   0   0      0
## 5      1        0        3         2         6   0   2   0   1   0   0      0
## 6      0        0        0         0         0   0   1   1   0   0   0      0
##   G.Dflct G.Slap G.Snap G.Tip G.Wrap G.Wrst CBar Post Over Wide S.Bkhd S.Dflct
## 1       0      0      0     0      0      0    0    0    0    1      0       0
## 2       0      1      0     0      0      1    0    1    2   51      2       0
## 3       2      0      3     0      0      9    0    2    4   51     19       3
## 4       0      0      0     0      0      1    0    0    0   15      0       0
## 5       0      0      0     1      0      6    0    2    1   20      7       2
## 6       0      2      0     0      0      2    0    1    6   28      2       0
##   S.Slap S.Snap S.Tip S.Wrap S.Wrst iPenT iPenD iPENT iPEND iPenDf   NPD Min
## 1      1      0     0      0      0     0     0     0     0      0   0.0   0
## 2     49     12     0      1     79    10     6    10     5     -4   2.2  10
## 3      3     20     8      2    101    12    10    11     8     -2  -0.5  12
## 4     18      3     0      0     19     6     7     6     6      1   2.7   4
## 5      3     10    11      1     61    44    33    40    29    -11 -14.3  33
## 6     32      9     1      0     30    12    11    12     8     -1   5.4  12
##   Maj Match Misc Game   CF   CA   FF   FA  SF  SA  xGF  xGA SCF SCA GF GA RBF
## 1   0     0    0    0    9   12    8   10   5   8  0.5  0.9   2   3  0  2   1
## 2   0     0    0    0 1433 1992 1038 1423 757 997 62.0 88.8 197 280 56 85  68
## 3   0     0    0    0 1301 1051  986  826 734 606 70.8 46.4 235 133 71 40  60
## 4   2     0    0    0  460  605  339  467 259 340 22.0 33.6  80 130 16 35  27
## 5   9     0    1    1  766  992  546  720 398 495 33.5 47.5 124 159 31 41  30
## 6   0     0    0    0 1356 1281  971  972 728 730 62.9 59.9 210 197 71 47  30
##   RBA RSF RSA DSF DSA FOW FOL  HF  HA GVA TKA PENT PEND  OPS  DPS   PS    OTOI
## 1   1   0   1   1   2   4   5   1   2   1   1    1    1  0.0 -0.2 -0.2   40.03
## 2  82  79  94 147 176 949 939 749 671 284 197  104   98 -0.2  3.4  3.2 2850.59
## 3  34  76  52 136  86 739 600 340 351 168 129   56   70  3.7  1.3  5.0 2486.75
## 4  20  29  32  56  52 324 328 198 197  86  59   26   22  0.0  0.4  0.5 1074.41
## 5  37  43  53  73  90 528 490 512 422 157 126   88   68 -0.1  1.4  1.3 3459.09
## 6  56  58  85  88 141 570 667 348 707 223 168   76   60  0.6  3.7  4.3 3069.81
##   Grit  DAP  Pace   GS  GS.G
## 1    1  0.0 175.7 -0.4 -0.38
## 2  290 13.3 112.5 14.1  0.18
## 3  102  6.6 114.8 36.8  0.57
## 4  130 17.5 105.1  5.9  0.20
## 5  425  8.3  99.5 21.8  0.27
## 6  150  4.5 107.4 20.8  0.26

Data Exploration

There are over 150 columns in this dataset, meaning the correlation heatmap for the full dataset will be hard to read. Regardless, I’ll make one as a big-picture starting point. I’ll also make a pairplot.

suppressWarnings({
  library(tidyr)
  library(corrplot)
  library(dplyr)
  })
## corrplot 0.92 loaded
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
print(ncol(salary_data))
## [1] 154
print(unique(colnames(salary_data)))
##   [1] "Salary"     "Born"       "City"       "Pr.St"      "Cntry"     
##   [6] "Nat"        "Ht"         "Wt"         "DftYr"      "DftRd"     
##  [11] "Ovrl"       "Hand"       "Last.Name"  "First.Name" "Position"  
##  [16] "Team"       "GP"         "G"          "A"          "A1"        
##  [21] "A2"         "PTS"        "X..."       "E..."       "PIM"       
##  [26] "Shifts"     "TOI"        "TOIX"       "TOI.GP"     "TOI.GP.1"  
##  [31] "TOI."       "IPP."       "SH."        "SV."        "PDO"       
##  [36] "F.60"       "A.60"       "Pct."       "Diff"       "Diff.60"   
##  [41] "iCF"        "iCF.1"      "iFF"        "iSF"        "iSF.1"     
##  [46] "iSF.2"      "ixG"        "iSCF"       "iRB"        "iRS"       
##  [51] "iDS"        "sDist"      "sDist.1"    "Pass"       "iHF"       
##  [56] "iHF.1"      "iHA"        "iHDf"       "iMiss"      "iGVA"      
##  [61] "iTKA"       "iBLK"       "iGVA.1"     "iTKA.1"     "iBLK.1"    
##  [66] "BLK."       "iFOW"       "iFOL"       "iFOW.1"     "iFOL.1"    
##  [71] "FO."        "X.FOT"      "dzFOW"      "dzFOL"      "nzFOW"     
##  [76] "nzFOL"      "ozFOW"      "ozFOL"      "FOW.Up"     "FOL.Up"    
##  [81] "FOW.Down"   "FOL.Down"   "FOW.Close"  "FOL.Close"  "OTG"       
##  [86] "X1G"        "GWG"        "ENG"        "PSG"        "PSA"       
##  [91] "G.Bkhd"     "G.Dflct"    "G.Slap"     "G.Snap"     "G.Tip"     
##  [96] "G.Wrap"     "G.Wrst"     "CBar"       "Post"       "Over"      
## [101] "Wide"       "S.Bkhd"     "S.Dflct"    "S.Slap"     "S.Snap"    
## [106] "S.Tip"      "S.Wrap"     "S.Wrst"     "iPenT"      "iPenD"     
## [111] "iPENT"      "iPEND"      "iPenDf"     "NPD"        "Min"       
## [116] "Maj"        "Match"      "Misc"       "Game"       "CF"        
## [121] "CA"         "FF"         "FA"         "SF"         "SA"        
## [126] "xGF"        "xGA"        "SCF"        "SCA"        "GF"        
## [131] "GA"         "RBF"        "RBA"        "RSF"        "RSA"       
## [136] "DSF"        "DSA"        "FOW"        "FOL"        "HF"        
## [141] "HA"         "GVA"        "TKA"        "PENT"       "PEND"      
## [146] "OPS"        "DPS"        "PS"         "OTOI"       "Grit"      
## [151] "DAP"        "Pace"       "GS"         "GS.G"
salary_data_numeric <- salary_data %>% mutate_if(~ !is.numeric(.), ~ as.numeric(factor(.)))
salary_data_numeric <- salary_data_numeric %>% drop_na()

print(paste("Salary Data Rows: ", nrow(salary_data)))
## [1] "Salary Data Rows:  612"
print(paste("Salary Data Rows, n/a rows excluded: ", nrow(salary_data_numeric)))
## [1] "Salary Data Rows, n/a rows excluded:  494"

Below is focused exploration on salary.

ggplot(mapping=aes(x=Salary), data=salary_data_numeric) +
  geom_histogram(aes(fill=factor(Salary)), show.legend = FALSE, bins=35) +
  labs(title = "Salary Histogram")

ggplot(mapping=aes(x=Salary), data=salary_data_numeric) +
  geom_density(color="darkred", show.legend=FALSE) +
  labs(title = "Salary Density Plot")

ggplot(mapping=aes(y=Salary), data=salary_data_numeric) +
  geom_boxplot(color="darkred", show.legend=FALSE) +
  labs(title = "Salary Box Plot")

print(summary(salary_data_numeric$Salary))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   575000   800000  1025000  2487720  4000000 13800000
res <- cor(salary_data_numeric, method="kendall")
corrplot(res, tl.cex=0.3, na.label.col="snow",
         method="color")

As expected, only some columns have Kendall correlations with Salary. Let’s select the columns with a correlation greater than 0.35.

greater_35_corr_vars <- colnames(res[, abs(res["Salary",]) > 0.35])
greater_35_corr_vars <- greater_35_corr_vars[!is.na(greater_35_corr_vars)]
print(greater_35_corr_vars)
##  [1] "Salary"   "GP"       "G"        "A"        "A1"       "A2"      
##  [7] "PTS"      "Shifts"   "TOI"      "TOIX"     "TOI.GP"   "TOI.GP.1"
## [13] "TOI."     "iCF"      "iCF.1"    "iFF"      "iSF"      "iSF.1"   
## [19] "iSF.2"    "ixG"      "iRB"      "iRS"      "iDS"      "Pass"    
## [25] "iMiss"    "iGVA"     "iTKA"     "iGVA.1"   "iTKA.1"   "GWG"     
## [31] "G.Wrst"   "Over"     "Wide"     "S.Slap"   "S.Snap"   "S.Wrst"  
## [37] "iPenT"    "iPENT"    "Min"      "CF"       "CA"       "FF"      
## [43] "FA"       "SF"       "SA"       "xGF"      "xGA"      "SCF"     
## [49] "SCA"      "GF"       "GA"       "RBF"      "RBA"      "RSF"     
## [55] "RSA"      "DSF"      "DSA"      "FOW"      "FOL"      "HF"      
## [61] "HA"       "GVA"      "TKA"      "PENT"     "PEND"     "OPS"     
## [67] "DPS"      "PS"       "OTOI"     "GS"       "GS.G"
salary_dnc <- salary_data_numeric %>% select(all_of(greater_35_corr_vars))
summary(salary_dnc)
##      Salary               GP              G                A        
##  Min.   :  575000   Min.   : 1.00   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.:  800000   1st Qu.:28.00   1st Qu.: 1.000   1st Qu.: 3.00  
##  Median : 1025000   Median :66.00   Median : 6.000   Median :11.50  
##  Mean   : 2487720   Mean   :54.25   Mean   : 8.466   Mean   :14.27  
##  3rd Qu.: 4000000   3rd Qu.:79.00   3rd Qu.:13.000   3rd Qu.:22.00  
##  Max.   :13800000   Max.   :82.00   Max.   :44.000   Max.   :63.00  
##        A1               A2              PTS            Shifts    
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   :  13  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 5.00   1st Qu.: 510  
##  Median : 6.000   Median : 5.000   Median :17.00   Median :1369  
##  Mean   : 7.941   Mean   : 6.334   Mean   :22.74   Mean   :1196  
##  3rd Qu.:12.000   3rd Qu.: 9.000   3rd Qu.:37.00   3rd Qu.:1805  
##  Max.   :36.000   Max.   :31.000   Max.   :89.00   Max.   :2657  
##       TOI              TOIX            TOI.GP         TOI.GP.1    
##  Min.   :   429   Min.   :   7.2   Min.   : 6.25   Min.   : 6.25  
##  1st Qu.: 21415   1st Qu.: 353.5   1st Qu.:12.33   1st Qu.:12.31  
##  Median : 59199   Median : 986.7   Median :15.62   Median :15.61  
##  Mean   : 54668   Mean   : 906.9   Mean   :15.55   Mean   :15.54  
##  3rd Qu.: 83812   3rd Qu.:1387.2   3rd Qu.:18.38   3rd Qu.:18.37  
##  Max.   :133550   Max.   :2218.9   Max.   :27.15   Max.   :27.12  
##       TOI.            iCF            iCF.1            iFF       
##  Min.   :13.00   Min.   :  1.0   Min.   :  1.0   Min.   :  1.0  
##  1st Qu.:23.12   1st Qu.: 61.0   1st Qu.: 61.0   1st Qu.: 41.5  
##  Median :27.70   Median :163.0   Median :164.0   Median :124.0  
##  Mean   :27.78   Mean   :172.3   Mean   :172.3   Mean   :128.9  
##  3rd Qu.:32.30   3rd Qu.:260.8   3rd Qu.:260.8   3rd Qu.:191.8  
##  Max.   :44.90   Max.   :624.0   Max.   :582.0   Max.   :455.0  
##       iSF             iSF.1            iSF.2             ixG        
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.: 29.25   1st Qu.: 29.25   1st Qu.: 29.25   1st Qu.: 2.100  
##  Median : 86.00   Median : 86.00   Median : 86.00   Median : 6.600  
##  Mean   : 93.12   Mean   : 93.33   Mean   : 93.36   Mean   : 8.371  
##  3rd Qu.:140.00   3rd Qu.:139.75   3rd Qu.:140.00   3rd Qu.:13.100  
##  Max.   :313.00   Max.   :313.00   Max.   :313.00   Max.   :33.000  
##       iRB              iRS              iDS             Pass      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 4.00   1st Qu.: 44.9  
##  Median : 5.000   Median : 7.000   Median :12.50   Median :127.5  
##  Mean   : 6.609   Mean   : 8.024   Mean   :14.63   Mean   :151.2  
##  3rd Qu.:10.000   3rd Qu.:12.000   3rd Qu.:22.00   3rd Qu.:231.2  
##  Max.   :41.000   Max.   :32.000   Max.   :63.00   Max.   :598.3  
##      iMiss             iGVA             iTKA           iGVA.1      
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.:  8.00   1st Qu.: 5.00   1st Qu.:  8.00  
##  Median : 34.00   Median : 22.00   Median :17.00   Median : 22.00  
##  Mean   : 35.93   Mean   : 25.71   Mean   :20.49   Mean   : 25.65  
##  3rd Qu.: 54.75   3rd Qu.: 39.00   3rd Qu.:32.00   3rd Qu.: 39.00  
##  Max.   :142.00   Max.   :106.00   Max.   :96.00   Max.   :106.00  
##      iTKA.1           GWG            G.Wrst            Over       
##  Min.   : 0.00   Min.   :0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 5.00   1st Qu.:0.000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :17.00   Median :1.000   Median : 2.000   Median : 3.000  
##  Mean   :20.44   Mean   :1.447   Mean   : 4.198   Mean   : 3.607  
##  3rd Qu.:32.00   3rd Qu.:2.000   3rd Qu.: 7.000   3rd Qu.: 5.000  
##  Max.   :96.00   Max.   :9.000   Max.   :30.000   Max.   :19.000  
##       Wide            S.Slap           S.Snap          S.Wrst     
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.: 10.00   1st Qu.:  2.00   1st Qu.: 3.00   1st Qu.: 14.0  
##  Median : 28.50   Median :  9.00   Median :10.00   Median : 44.0  
##  Mean   : 30.44   Mean   : 16.39   Mean   :14.59   Mean   : 48.1  
##  3rd Qu.: 46.00   3rd Qu.: 22.00   3rd Qu.:21.00   3rd Qu.: 71.0  
##  Max.   :112.00   Max.   :141.00   Max.   :78.00   Max.   :214.0  
##      iPenT          iPENT            Min              CF        
##  Min.   : 0.0   Min.   : 0.00   Min.   : 0.00   Min.   :   8.0  
##  1st Qu.: 4.0   1st Qu.: 4.00   1st Qu.: 4.00   1st Qu.: 307.0  
##  Median :10.0   Median :10.00   Median : 9.00   Median : 877.0  
##  Mean   :11.2   Mean   :10.73   Mean   :10.09   Mean   : 850.8  
##  3rd Qu.:16.0   3rd Qu.:16.00   3rd Qu.:15.00   3rd Qu.:1324.2  
##  Max.   :48.0   Max.   :44.00   Max.   :39.00   Max.   :2308.0  
##        CA               FF               FA               SF        
##  Min.   :   6.0   Min.   :   5.0   Min.   :   5.0   Min.   :   4.0  
##  1st Qu.: 336.5   1st Qu.: 224.5   1st Qu.: 244.0   1st Qu.: 163.0  
##  Median : 895.5   Median : 655.0   Median : 670.0   Median : 471.5  
##  Mean   : 823.7   Mean   : 634.8   Mean   : 614.8   Mean   : 456.9  
##  3rd Qu.:1215.5   3rd Qu.: 987.8   3rd Qu.: 914.0   3rd Qu.: 714.8  
##  Max.   :2273.0   Max.   :1668.0   Max.   :1765.0   Max.   :1181.0  
##        SA              xGF              xGA              SCF        
##  Min.   :   2.0   Min.   :  0.20   Min.   :  0.40   Min.   :  0.00  
##  1st Qu.: 177.5   1st Qu.: 13.70   1st Qu.: 14.82   1st Qu.: 45.25  
##  Median : 483.0   Median : 39.75   Median : 41.75   Median :129.50  
##  Mean   : 442.2   Mean   : 41.02   Mean   : 39.16   Mean   :136.57  
##  3rd Qu.: 656.0   3rd Qu.: 64.42   3rd Qu.: 57.77   3rd Qu.:210.00  
##  Max.   :1276.0   Max.   :111.10   Max.   :108.00   Max.   :419.00  
##       SCA              GF               GA              RBF        
##  Min.   :  0.0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 48.0   1st Qu.: 11.25   1st Qu.: 13.00   1st Qu.: 10.00  
##  Median :137.0   Median : 37.00   Median : 40.00   Median : 29.00  
##  Mean   :130.4   Mean   : 40.54   Mean   : 37.69   Mean   : 32.55  
##  3rd Qu.:195.8   3rd Qu.: 67.00   3rd Qu.: 57.00   3rd Qu.: 50.00  
##  Max.   :344.0   Max.   :120.00   Max.   :100.00   Max.   :110.00  
##       RBA             RSF              RSA              DSF        
##  Min.   : 0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:11.25   1st Qu.: 14.25   1st Qu.: 15.00   1st Qu.: 25.00  
##  Median :30.00   Median : 39.00   Median : 40.00   Median : 70.50  
##  Mean   :30.30   Mean   : 39.71   Mean   : 39.37   Mean   : 72.27  
##  3rd Qu.:45.00   3rd Qu.: 59.75   3rd Qu.: 58.00   3rd Qu.:109.75  
##  Max.   :95.00   Max.   :130.00   Max.   :112.00   Max.   :213.00  
##       DSA              FOW              FOL               HF       
##  Min.   :  0.00   Min.   :   4.0   Min.   :   4.0   Min.   :  0.0  
##  1st Qu.: 27.00   1st Qu.: 149.2   1st Qu.: 163.0   1st Qu.:157.2  
##  Median : 75.50   Median : 461.5   Median : 478.0   Median :343.0  
##  Mean   : 69.67   Mean   : 447.7   Mean   : 444.5   Mean   :327.4  
##  3rd Qu.:103.00   3rd Qu.: 696.0   3rd Qu.: 674.5   3rd Qu.:464.0  
##  Max.   :185.00   Max.   :1257.0   Max.   :1196.0   Max.   :926.0  
##        HA             GVA              TKA             PENT       
##  Min.   :  2.0   Min.   :  0.00   Min.   :  0.0   Min.   :  0.00  
##  1st Qu.:169.8   1st Qu.: 49.25   1st Qu.: 36.0   1st Qu.: 21.00  
##  Median :343.0   Median :132.50   Median :102.5   Median : 55.00  
##  Mean   :324.2   Mean   :132.64   Mean   :100.0   Mean   : 51.47  
##  3rd Qu.:468.8   3rd Qu.:203.75   3rd Qu.:149.0   3rd Qu.: 76.00  
##  Max.   :870.0   Max.   :388.00   Max.   :347.0   Max.   :123.00  
##       PEND             OPS              DPS               PS        
##  Min.   :  0.00   Min.   :-1.500   Min.   :-0.200   Min.   :-1.200  
##  1st Qu.: 22.00   1st Qu.: 0.000   1st Qu.: 0.400   1st Qu.: 0.400  
##  Median : 53.00   Median : 0.600   Median : 1.100   Median : 2.200  
##  Mean   : 51.03   Mean   : 1.478   Mean   : 1.433   Mean   : 2.913  
##  3rd Qu.: 76.00   3rd Qu.: 2.575   3rd Qu.: 2.000   3rd Qu.: 4.875  
##  Max.   :137.00   Max.   :10.500   Max.   : 7.200   Max.   :12.900  
##       OTOI               GS              GS.G        
##  Min.   :  33.51   Min.   :-3.500   Min.   :-0.5900  
##  1st Qu.:1072.76   1st Qu.: 4.925   1st Qu.: 0.1725  
##  Median :2632.04   Median :18.550   Median : 0.3300  
##  Mean   :2149.35   Mean   :24.315   Mean   : 0.3673  
##  3rd Qu.:3067.10   3rd Qu.:39.350   3rd Qu.: 0.5675  
##  Max.   :3521.78   Max.   :99.200   Max.   : 1.2600

This is a little more legible, but still hard to decipher which variables have the strongest correlation with salary. We can take a quick look at the correlation colors present in the top row of the map. As far as salary goes, there are a lot of so-so, sky blue correlations; none seem to be a darker shade of blue. There also appears to be stronger correlations between some indicator variables, as shown by the presence of darker colored squares in the heatmap.

res_35greater <- cor(salary_dnc, method="kendall")
corrplot(res_35greater, na.label="square", na.label.col="snow", tl.cex = 0.5,
         method="color")

Finally, let’s take an up-close look at variables with the strongest correlations. There are six variables with correlation scores greater than 0.5.

strongest_cols <- colnames(res[, abs(res["Salary",]) > 0.5])
strongest_cols <- strongest_cols[!is.na(strongest_cols)]
print(strongest_cols)
## [1] "Salary" "CF"     "FF"     "SF"     "xGF"    "SCF"    "GF"
salary_most_cut <- salary_data_numeric %>% select(all_of(strongest_cols))
summary(salary_most_cut)
##      Salary               CF               FF               SF        
##  Min.   :  575000   Min.   :   8.0   Min.   :   5.0   Min.   :   4.0  
##  1st Qu.:  800000   1st Qu.: 307.0   1st Qu.: 224.5   1st Qu.: 163.0  
##  Median : 1025000   Median : 877.0   Median : 655.0   Median : 471.5  
##  Mean   : 2487720   Mean   : 850.8   Mean   : 634.8   Mean   : 456.9  
##  3rd Qu.: 4000000   3rd Qu.:1324.2   3rd Qu.: 987.8   3rd Qu.: 714.8  
##  Max.   :13800000   Max.   :2308.0   Max.   :1668.0   Max.   :1181.0  
##       xGF              SCF               GF        
##  Min.   :  0.20   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 13.70   1st Qu.: 45.25   1st Qu.: 11.25  
##  Median : 39.75   Median :129.50   Median : 37.00  
##  Mean   : 41.02   Mean   :136.57   Mean   : 40.54  
##  3rd Qu.: 64.42   3rd Qu.:210.00   3rd Qu.: 67.00  
##  Max.   :111.10   Max.   :419.00   Max.   :120.00

The six variables are CF (team’s shot attempts while player on ice), FF (team’s unblocked attempts while player on ice), SF (team’s shots on goal while player on ice), xGF (team’s expected goals while player on ice, by attempts by location), SCF (team’s scoring chances while player on ice), and GF (team’s goals while player on ice).

Since a lot of these variables are similar or related, it makes sense that there is a high correlation score between them.

res_cut2 <- cor(salary_most_cut, method="kendall")
corrplot(res_cut2, na.label="square", tl.cex = 1, addCoef.col="white", method="color")

Here are scatterplots for these six variables. A linear regression line is included. These variables may or may not have a linear correlation, but rather an exponential correlation, polynomial, simple monotonic relationship, etc. The linear line helps compare overall shapes and positive trends in an exploratory chapter. Included is a pairplot with histograms on the diagonals.

for (variable in colnames(salary_most_cut)) {
  print(ggplot(mapping=aes(y=salary_most_cut$Salary, x=salary_most_cut[[variable]])) +
  geom_point(color="red") +
  geom_smooth(method="lm", formula = y ~ x, color="darkred") +
  labs(title = paste("Salary vs.", variable),
       x = variable,
       y = "Salary"))
}

#from R help page
#help(pairs)
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = "red", ...)
}

pairs(salary_most_cut, diag.panel = panel.hist)

Here are some graphs related to salary. One explores the relationship between salary and team goals while the player was on ice. Another shows a scatterplot between salary and goals scored by the player themself. Both of these scatterplots separate points into two color groups – one for left-handed players and another for right-handed. Finally, there is a boxplot of salary for left-handed and for right-handed players.

Interestingly, there are more left-handed players in the dataset than right-handed. There doesn’t seem to be a significant trend on the scatterplots, however, it is interesting that three out of the top five salaries are given to left-handed players. The boxplots show that the median salary for both left- and right-handed players is the same (925,000), but the IQR (3,087,750), upper fence (7,500,000), and maximum (including outliers) (13,800,000) are higher for left-handed players are larger than those for right-handed players (2,582,500; 7,000,000; 12,000,000 respectively). The 75th percentile salary is nearly 13% higher for left-handed players than right-handed.

suppressWarnings({library(plotly)})
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
sd_left <- salary_data[salary_data$Hand == "L",]
sd_right <- salary_data[salary_data$Hand == "R",]

ggplot(mapping=aes(x=Hand), data=salary_data) +
  geom_bar(fill = c("darkred", "red")) +
  labs(title = "Dominant Hand Bar Chart")

ggplot(mapping=aes(y=Salary, x=GF, color=factor(Hand)), data=salary_data) +
  geom_point(na.rm=TRUE) +
  scale_color_manual(values=c("darkred", "red")) +
  labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Right/Left Handed") +
  geom_smooth(method="lm", formula = y ~ x, data=sd_left, color = "darkred", na.rm=TRUE) +
  geom_smooth(method="lm", formula = y ~ x, data=sd_right, color="red", na.rm=TRUE)

ggplot(mapping=aes(y=Salary, x=G, color=factor(Hand)), data=salary_data) +
  geom_point(na.rm=TRUE) +
  scale_color_manual(values=c("darkred", "red")) +
  labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Right/Left Handed") +
  geom_smooth(method="lm", formula = y ~ x, data=sd_left, color = "darkred") +
  geom_smooth(method="lm", formula = y ~ x, data=sd_right, color="red")

hand_plot <- ggplot(mapping=aes(y=Salary, x=Hand), data=salary_data) +
  geom_boxplot(fill = c("darkred", "red"), show.legend=FALSE) + 
  #scale_color_manual(values = c("darkred", "red")) +
  labs(title = "Boxplots of Salary by Dominant Hand") + 
  theme(axis.text.x = element_text(angle=90))
ggplotly(hand_plot)

Below are the same graphs, but this time, color groups are separated by country.

ggplot(mapping=aes(y=Salary, x=GF, color=factor(Cntry)), data=salary_data) +
  geom_point(na.rm=TRUE) +
  labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Country")

ggplot(mapping=aes(y=Salary, x=G, color=factor(Cntry)), data=salary_data) +
  geom_point(na.rm=TRUE) +
  labs(title = "Salary vs. Team Goals while Player on Ice (GF) - Country")

Here is a boxplot for salary by country, and another which breaks the prior graph into dominant hand groups. With 291 players, Canada has the most players in the dataset by far. Canada is followed by the US with 168. Canada and the US both pay the maximum salary in the dataset (13,800,000) to at least one of their players, and Canada pays the highest right-handed salary (12,000,000) as well. Many countries have higher median values for left-handed players than right-handed, although this could be due to more left-handed players appearing in the dataset.

cntry_bar <- ggplot(mapping=aes(x=Cntry), data=salary_data) +
  geom_bar(mapping=aes(fill=factor(Cntry)), show.legend=FALSE) +
  labs(title = "Distribution of Countries") +
  theme(axis.text.x = element_text(angle=90))
ggplotly(cntry_bar)
cntry_sal <- ggplot(mapping=aes(y=Salary, x=Cntry), data=salary_data) +
  geom_boxplot(mapping=aes(fill=factor(Cntry))) + 
  labs(title = "Boxplots of Salary by Country")
ggplotly(cntry_sal)
cntry_sal_hand <- ggplot(mapping=aes(x=Cntry, y=Salary), data=salary_data) +
  geom_boxplot(mapping=aes(fill=factor(Cntry))) + 
  labs(title = "Boxplots of Salary by Country - Dominant Hand") + 
  theme(axis.text = element_text(angle=90)) +
  facet_wrap(~Hand) 
ggplotly(cntry_sal_hand)

Here’s a summary of salary in the USA.

sd_usa <- salary_data[salary_data$Cntry == "USA",]
summary(sd_usa$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   575000   667500   925000  2047664  2500000 13800000

Below is a distribution graph of US cities in the dataset and boxplots of salary by US city. Because there are many cities in the dataset, I made these plots interactive, so that zooming in and out and moving around the graph would be possible.

Ann Arbor, Madison, and Rochester have the most players in the dataset, followed by Buffalo, Minneapolis, and Pittsburgh. The majority of US cities in the dataset include data for one player.

The US player receiving the maximum salary (13,800,000) is from Buffalo, a city with a median salary of 1,587,500. Ann Arbor’s median salary matches the country’s median of 925,000, while Detroit fell short at 842,500. Madison had a median of 4,000,000, St. Paul 5,000,000, and Minneapolis 6,500,000.

plot1 <- ggplot(mapping=aes(x=City), data=sd_usa) +
  geom_bar(mapping=aes(fill=factor(City)), show.legend=FALSE) +
  labs(title = "Distribution of US Cities") +
  theme(axis.text.x = element_text(angle=90),
        axis.text = element_text(size=5))
ggplotly(plot1)
plot2 <- ggplot(mapping=aes(y=Salary, x=City), data=sd_usa) +
  geom_boxplot(mapping=aes(fill=factor(City)), show.legend=FALSE) +
  labs(title = "Boxplots of Salary by US City") + 
  theme(axis.text.x = element_text(angle=90),
        axis.text = element_text(size=5))
ggplotly(plot2)

Here are some plots exploring salary by team and draft year.

ggplot(mapping=aes(x=Team), data=sd_usa) +
  geom_bar(mapping=aes(fill=factor(Team)), show.legend=FALSE) +
  labs(title = "Distribution of US Teams") +
  theme(axis.text.x = element_text(angle=90))

ggplot(mapping=aes(y=Salary, x=Team), data=salary_data) +
  geom_boxplot(mapping=aes(fill=factor(Team)), show.legend=FALSE) + 
  labs(title = "Boxplots of Salary by Team") + 
  theme(axis.text.x = element_text(angle=90),
        axis.text = element_text(size=5))

ggplot(mapping=aes(x=DftYr), data=sd_usa) +
  geom_bar(mapping=aes(fill=factor(DftYr)), show.legend=FALSE, na.rm=TRUE) +
  labs(title = "Distribution of Draft Year in US") +
  theme(axis.text.x = element_text(angle=90))

ggplot(mapping=aes(y=Salary, x=DftYr), data=salary_data) +
  geom_boxplot(mapping=aes(fill=factor(DftYr)), show.legend=FALSE, na.rm=TRUE) + 
  labs(title = "Boxplots of Salary by Draft Year") + 
  theme(axis.text.x = element_text(angle=90))

Random Forest Regression

In this section, I want to use random forest regression to identify the most important determinants of salary. I will use feature selection and hyperparameter tuning to increase the accuracy of the model.

setwd("C:\\Users\\eaalc\\OneDrive - Umich\\hockey_salary\\archive (11)")
y_train_data <- read.csv("test.csv")
y_test_data <- read.csv("test_salaries.csv")

#y_train_data is original csv, with all predictor columns
#y_train_test_data will include the salary column in y_test_data
y_train_test_data <- y_train_data
y_train_test_data$Salary <- y_test_data$Salary

#drop na values in all of y, so that salary is dropped with null predictor values
y_train_test_data <- y_train_test_data %>% drop_na()

#only using salary from y_train_test_data. drop na values from y_train_data
y_train_data <- y_train_data %>% drop_na()

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
suppressWarnings({library(randomForestExplainer)})
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

All Columns of Original

Achieved mean absolute error (MAE) of $1,088,142 and root mean square error (RMSE) of $1,609,394. Identified as most important determining factors are date of birth, draft year, time on ice divided by games played (TOI.GP), and team shots on goal while the player was on ice (SF).

The output of this model shows a variety of different visualizations and text. The first block shows a summary of the model, prints the MAE and RMSE, and prints the importance of model features in descending order. The “Predicted vs. Actual” scatterplot shows how predicted values compare to actual values. The closer these points are to the line y=x, the better the predictions fared. “rf_results” shows how error in the model changed with number of trees used. This is useful in understanding over- and under-fitting of the random forest to the training dataset. Finally, and most importantly, the final graph prints the Top 10 most important model features as measured by IncNodePurity and %IncMSE.

set.seed(123)
salary_data <- salary_data %>% drop_na()
rf_results <- randomForest(Salary ~ ., data=salary_data, localImp=TRUE)
summary(rf_results)
##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted         494  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times         494  -none- numeric  
## importance        306  -none- numeric  
## importanceSD      153  -none- numeric  
## localImportance 75582  -none- numeric  
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y                 494  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
#generate predictions with y_train_data
rf_predictions <- predict(rf_results, y_train_data)

#getting salary from y_train_test_data and storing in variable name real_ytest
real_ytest <- y_train_test_data$Salary

ggplot(mapping=aes(y=real_ytest, x=rf_predictions)) +
  geom_point(color="darkred") +
  labs(title = "Predicted Salary vs. Actual",
       x = "Predicted Salary",
       y = "Actual Salary") +
  geom_abline(intercept=0, slope=1, color="red")

suppressWarnings({library(Metrics)})
print(paste("MEAN ABSOLUTE ERROR:", mae(rf_predictions, real_ytest)))
## [1] "MEAN ABSOLUTE ERROR: 1088142.03037281"
print(paste("ROOT MEAN SQUARED ERROR:", rmse(rf_predictions, real_ytest)))
## [1] "ROOT MEAN SQUARED ERROR: 1609394.15926762"
plot(rf_results)

varImpPlot(rf_results, sort=TRUE, n.var = 10,
                          main = "Top 10 Variables used in RF Model", pch=19,
           color="darkred")

print(sort(importance(rf_results)[,1], decreasing=TRUE))
##        Born       DftYr      TOI.GP        TOI.    TOI.GP.1          SF 
## 24.48222577 22.92848345 11.73406395 10.80076317 10.72727988  9.86906766 
##         xGF        Ovrl          GF          FF         SCF         FOL 
##  8.51776919  7.83402677  7.32961556  6.92807044  6.79411475  6.73076629 
##          CF        OTOI          GP         FOW        GS.G      Shifts 
##  6.45466561  6.36383943  6.08440565  5.90583900  5.01699464  4.91578453 
##          HF          PS       iSF.2         iFF       sDist       iSF.1 
##  4.85718340  4.74014576  4.72451063  4.60501501  4.48470960  4.46351801 
##        iBLK      S.Wrap         OTG       DftRd     sDist.1          GS 
##  4.43222014  4.28252046  4.25308398  4.25149411  4.19343920  4.17737833 
##         ixG         iSF       iCF.1      iBLK.1          GA         GWG 
##  4.13608080  4.10507228  4.06631218  4.02116744  3.99973279  3.93946130 
##         SCA      S.Wrst          A2         TOI      S.Snap         iCF 
##  3.86058335  3.76362410  3.76254270  3.69436912  3.67939623  3.67020059 
##        CBar        Wide        Grit       iMiss         iRS        iSCF 
##  3.66607979  3.64237524  3.54208130  3.52131453  3.43988878  3.38394176 
##         PTS      iTKA.1       ozFOW       S.Tip         RSF          Wt 
##  3.33109010  3.32519030  3.23117426  3.21365817  3.20093490  3.14135932 
##         iDS         OPS      S.Slap         RBF      iGVA.1           A 
##  3.13877335  3.13628792  3.10939014  3.08130268  3.06144052  3.03103471 
##         GVA         DSF          FA         xGA        PENT         TKA 
##  3.02156781  2.99173075  2.99092786  2.97854212  2.91429328  2.90935273 
##         DPS        IPP.        Pass         RBA          CA         Min 
##  2.86103426  2.84384853  2.79503073  2.78882920  2.77850239  2.76795625 
##       iHF.1       iPenT        F.60        TOIX         SH.        E... 
##  2.71190341  2.68984836  2.60064294  2.56786992  2.51455251  2.49144599 
##         RSA        iTKA        BLK.          SA        Diff      S.Bkhd 
##  2.44407266  2.42134460  2.39243282  2.23916173  2.21108858  2.20867573 
##         iHA         DSA         DAP         iHF         ENG          HA 
##  2.14498353  2.13420356  2.07068288  2.04714856  2.04630795  2.04021212 
##        iGVA      G.Snap        Over        Misc         PIM       iPenD 
##  2.03025917  2.02108296  1.97112616  1.95798089  1.92718457  1.78309563 
##        Pct.       Cntry        Pace         PDO    FOL.Down      iFOW.1 
##  1.74884573  1.73607069  1.69019169  1.54957404  1.54792494  1.52947761 
##       G.Tip       nzFOW        iFOL       dzFOW        Team        Game 
##  1.52420729  1.50334434  1.48112138  1.37910173  1.35451700  1.24394345 
##           G    FOW.Down       Pr.St       ozFOL       nzFOL          A1 
##  1.19016434  1.17640653  1.13783830  1.00629117  0.98189746  0.95266848 
##         iRB        PEND       iPENT     S.Dflct     Diff.60          Ht 
##  0.92155709  0.88172925  0.87681677  0.78930881  0.77622052  0.74119449 
##    Position        iHDf      iPenDf        A.60        Post         PSA 
##  0.59831217  0.57297713  0.44108611  0.40748425  0.26687492  0.26609235 
##      G.Wrst       Match         FO.         Nat      iFOL.1         NPD 
##  0.04820722  0.00000000 -0.05155809 -0.06414688 -0.06892691 -0.16877666 
##        iFOW   FOL.Close         X1G         SV.     G.Dflct         Maj 
## -0.20062277 -0.29874328 -0.38931997 -0.44871140 -0.46487547 -0.49517645 
##       iPEND       X.FOT       dzFOL        Hand        X...      FOL.Up 
## -0.61487557 -0.75830981 -0.82737891 -0.87631340 -0.90911246 -1.00538887 
##      G.Bkhd         PSG      FOW.Up      G.Slap      G.Wrap   FOW.Close 
## -1.12974574 -1.33542546 -1.35273615 -1.47222833 -1.62454432 -1.92858999 
##   Last.Name  First.Name        City 
## -2.15034579 -2.21685057 -2.37076719
feature_importances <- c(sort(importance(rf_results)[,1], decreasing=TRUE))

Cut Weakest Columns

While I still see columns that may be creating noise in the model (e.g.: first name), I will cut the negative importance columns.

After cutting the noisy columns, this model achieved a MAE of $1,076,131 and RMSE of $1,603,168. The overall set of important features remained very similar, however, the IncNodePurity chart increased the importance level of Draft Year and team goals scored while player on ice (GF).

print_rf_results <- function(model, ytrain_data, ytest_data) {
  print(summary(model))
  rf_predicted <- predict(model, ytrain_data)
  ggplot(mapping=aes(y=ytest_data, x=rf_predicted)) +
    geom_point(color="darkred") +
    labs(title = "Predicted Salary vs. Actual",
         x = "Predicted Salary",
         y = "Actual Salary") +
    geom_abline(intercept=0, slope=1, color="red")

print(paste("MEAN ABSOLUTE ERROR:", mae(rf_predicted, ytest_data)))
print(paste("ROOT MEAN SQUARED ERROR:", rmse(rf_predicted, ytest_data)))
plot(model)
varImpPlot(model, sort=TRUE, n.var = 10,
                          main = "Top 10 Variables used in RF Model", pch=19,
           color="darkred")

print(sort(importance(model)[,1], decreasing=TRUE))
feature_importances <- c(sort(importance(model)[,1], decreasing=TRUE))

return(feature_importances)
  
}
#getting rid of features with negative importance and storing in variable 
#"important_features"
important_features <- feature_importances[feature_importances > 0]

#subsetting salary_data to get all important features
X_train <- salary_data %>% select(all_of(names(important_features)))
#including salary column in the one dataset
X_train$Salary <- salary_data$Salary

#X_train has ALL predictor values as well as the Salary column. It functions as 
#X_train and X_test. 

#subsetting y_train_data to get all important features
y_train <- y_train_data %>% select(all_of(names(important_features)))
#storing salary column of y_train_test into y_test -- since important_features only drops
#columns, y_train_test_data should be fine as is
y_test <- y_train_test_data$Salary
rf_resultsNW <- randomForest(Salary ~ ., data=X_train, localImp=TRUE)
feature_importancesNW <- print_rf_results(rf_resultsNW, y_train, y_test)
##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted         494  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times         494  -none- numeric  
## importance        254  -none- numeric  
## importanceSD      127  -none- numeric  
## localImportance 62738  -none- numeric  
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y                 494  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call     
## [1] "MEAN ABSOLUTE ERROR: 1076131.70506213"
## [1] "ROOT MEAN SQUARED ERROR: 1603168.40741544"

##        Born       DftYr      TOI.GP    TOI.GP.1        TOI.          SF 
## 22.51636736 21.72454682 11.96781846 11.18372177 10.16464320  8.36165988 
##         xGF          GF         SCF          FF        Ovrl          GP 
##  7.61187451  7.60899141  7.35698957  6.99510385  6.63465414  6.61548641 
##         FOL          CF        OTOI         FOW       iSF.1         RSA 
##  6.51583676  6.47466745  6.37114615  6.09274955  5.23686643  4.63424449 
##          HF        F.60         SCA      Shifts         iSF          PS 
##  4.34888463  4.34517272  4.23697879  4.22948025  4.12548343  4.05097382 
##        Over      S.Snap        iSCF         GVA       iSF.2       ozFOW 
##  4.02191939  4.02191531  4.01502704  3.97610775  3.90008350  3.83677857 
##         iCF      iTKA.1         DSF       iPenT       sDist         TKA 
##  3.81522500  3.71710578  3.71599836  3.70959921  3.65075738  3.58744321 
##          GA        TOIX       iMiss         iFF      S.Slap          A2 
##  3.54249433  3.48309197  3.48132266  3.46495925  3.45016427  3.36769195 
##      S.Wrst          GS        iBLK         RBA         RBF          SA 
##  3.34368707  3.28533733  3.25152580  3.21542018  3.17701529  3.13802465 
##        GS.G        iTKA         RSF         PIM       DftRd        Wide 
##  3.12286370  3.12125206  3.05074890  3.03458540  3.01904691  2.96558252 
##         iRS         PTS        CBar        iGVA          CA         OTG 
##  2.95901899  2.87842256  2.84459651  2.83365242  2.82972265  2.82388651 
##      iGVA.1       iCF.1       S.Tip     sDist.1        Misc       iPenD 
##  2.77337932  2.76161370  2.67549755  2.67208328  2.61461825  2.60739419 
##        Grit         TOI         iHA         GWG         iDS      G.Snap 
##  2.59633077  2.58869594  2.57847040  2.57514383  2.57194685  2.57039365 
##       iPENT         iRB         xGA      iBLK.1         ixG         DSA 
##  2.47146217  2.43695482  2.42208076  2.37729426  2.36651813  2.34749771 
##      S.Wrap           A        PEND         SH.         OPS        BLK. 
##  2.32528795  2.31926603  2.28157041  2.27515794  2.27437233  2.23883888 
##        IPP.       iHF.1        E...         DPS         Min          FA 
##  2.14613112  2.09510854  2.02797460  1.93275855  1.92437411  1.84374169 
##        Pace        Pass      G.Wrst        PENT        Pct.    Position 
##  1.82712928  1.80901946  1.74303679  1.72825345  1.51437362  1.45187133 
##    FOW.Down          HA        A.60      S.Bkhd          Ht        Post 
##  1.35029181  1.34804615  1.28565954  1.22690339  1.22134462  1.20639003 
##           G         iHF         DAP          Wt          A1    FOL.Down 
##  1.19834494  1.14859269  1.12356617  0.97089455  0.88310119  0.77119281 
##       nzFOW      iFOW.1       Pr.St      iPenDf     S.Dflct        Diff 
##  0.66716668  0.45709188  0.34739970  0.33439499  0.32724110  0.02398969 
##       ozFOL       G.Tip       dzFOW       Cntry        Team       nzFOL 
## -0.04569959 -0.08244146 -0.09561549 -0.32614220 -0.42658866 -0.52081539 
##         PSA         PDO     Diff.60        iFOL         ENG        Game 
## -0.66782499 -0.73859904 -0.79089063 -0.88207712 -1.39041383 -1.49393369 
##        iHDf 
## -1.91849850

Hyperparameter Tuning & Cross-Validation: Caret & Mtry

With hyperparameter tuning, this model saw slight improvements in MAE and RMSE. The overall set of important features remained close to the same.

library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
## 
##     precision, recall
set.seed(123)
control <- trainControl(method="repeatedcv", number = 5, repeats = 3)
print(ncol(X_train))
## [1] 128
grid <- expand.grid(.mtry = seq(35,125, by = 15)) #default is (#predictors)/3, which would be 127/3, which is just over 42

trained_rf <- train(
Salary~., 
data=X_train, 
trControl=control,
tuneGrid = grid,
method = "rf",
tuneLength = 5)

print(trained_rf)
## Random Forest 
## 
## 494 samples
## 127 predictors
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 397, 395, 394, 395, 395, 394, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE     Rsquared   MAE      
##    35   1591985  0.5508171  1082468.8
##    50   1563445  0.5677225  1060544.8
##    65   1545657  0.5781056  1046233.6
##    80   1528749  0.5877083  1032074.3
##    95   1513726  0.5962288  1023065.1
##   110   1500858  0.6034384  1014024.0
##   125   1482960  0.6142353   997661.3
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 125.
print(trained_rf$finalModel)
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 125
## 
##           Mean of squared residuals: 2.117251e+12
##                     % Var explained: 61.86
finalModel <- randomForest(Salary ~., data = X_train, localImp = TRUE, mtry = 85, ntree = 500)

print_rf_results(finalModel, y_train, y_test)
##                 Length Class  Mode     
## call                6  -none- call     
## type                1  -none- character
## predicted         494  -none- numeric  
## mse               500  -none- numeric  
## rsq               500  -none- numeric  
## oob.times         494  -none- numeric  
## importance        254  -none- numeric  
## importanceSD      127  -none- numeric  
## localImportance 62738  -none- numeric  
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             11  -none- list     
## coefs               0  -none- NULL     
## y                 494  -none- numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call     
## [1] "MEAN ABSOLUTE ERROR: 1083522.6124269"
## [1] "ROOT MEAN SQUARED ERROR: 1595992.86533047"

##         Born        DftYr         TOI.       TOI.GP     TOI.GP.1           GF 
## 26.764393367 24.008220239 12.804791398 12.414052458 11.421630101 10.051454961 
##         Ovrl           SF          xGF           GP         OTOI           FF 
##  9.351391868  9.008954733  7.133221487  6.483796266  6.355644812  6.297115912 
##          FOW          FOL          OTG           GS          iSF          SCF 
##  5.788700891  5.769495779  5.415682952  5.146832115  5.130163367  5.032195669 
##         CBar        sDist       S.Snap       S.Slap       S.Wrst         iSCF 
##  4.748727990  4.509959753  4.461433069  4.313212598  4.277235774  4.248517713 
##       Shifts           CF           GA          iCF        iSF.1          GVA 
##  4.215448667  4.180624326  4.139789561  4.063415474  3.992185943  3.974876454 
##          RSA          DSA        DftRd           HF          SCA        iMiss 
##  3.955751691  3.834853297  3.813739363  3.810778951  3.775163741  3.690408899 
##           A2         Wide         GS.G           PS          iRB          ixG 
##  3.676290786  3.645108607  3.633305438  3.610214150  3.595268660  3.480885471 
##        S.Tip          TOI          GWG         Pace          RSF       iBLK.1 
##  3.346414030  3.329585036  3.221806341  3.217713209  3.174221855  3.025100847 
##          TKA          SH.          iDS          xGA          PTS          RBF 
##  3.018045764  2.974892868  2.971259194  2.835857332  2.741014454  2.735896100 
##        ozFOW         Diff        iHF.1         Over     FOW.Down          DSF 
##  2.722534258  2.685076261  2.666567624  2.664120804  2.579676300  2.561030820 
##        iCF.1         PENT         iBLK         PEND          iHA          DPS 
##  2.557349361  2.528110773  2.526564711  2.508840054  2.449815740  2.435116068 
##        iPENT        iPenD         F.60           HA         Pct.            A 
##  2.415971545  2.346187236  2.274454030  2.238342994  2.234317528  2.169024021 
##         iTKA           SA         Pass           FA         BLK.      sDist.1 
##  2.125826173  2.120084133  2.072541943  1.997675758  1.957988996  1.926816121 
##       iGVA.1         IPP.       G.Snap           CA         Grit          iFF 
##  1.880160904  1.784961551  1.709598294  1.707276479  1.684499804  1.649876053 
##          PIM          Min          OPS         A.60          PSA        dzFOW 
##  1.623735388  1.607664868  1.570998803  1.485317615  1.454454279  1.438152140 
##         iGVA          iRS         E...     FOL.Down          RBA        iPenT 
##  1.394277913  1.377221218  1.373461469  1.350924783  1.329429951  1.295351023 
##        iSF.2         Post         TOIX      Diff.60        G.Tip       iTKA.1 
##  1.275978881  1.267433739  1.267174176  1.250864070  1.071081145  0.884188406 
##         Misc      S.Dflct          ENG       G.Wrst           A1           Wt 
##  0.829602185  0.777813781  0.776432704  0.621970052  0.611672853  0.577726108 
##            G       S.Wrap        ozFOL        nzFOL       S.Bkhd           Ht 
##  0.502518936  0.282796356 -0.005267235 -0.118169065 -0.129360937 -0.143281981 
##          DAP          PDO       iPenDf        Cntry        Pr.St         Game 
## -0.200812551 -0.234759389 -0.320382475 -0.470494862 -0.682933720 -0.696145437 
##          iHF         iFOL        nzFOW       iFOW.1         Team         iHDf 
## -0.742009387 -0.823680443 -1.075200282 -1.240703523 -1.270223270 -1.438626597 
##     Position 
## -1.503646733
##         Born        DftYr         TOI.       TOI.GP     TOI.GP.1           GF 
## 26.764393367 24.008220239 12.804791398 12.414052458 11.421630101 10.051454961 
##         Ovrl           SF          xGF           GP         OTOI           FF 
##  9.351391868  9.008954733  7.133221487  6.483796266  6.355644812  6.297115912 
##          FOW          FOL          OTG           GS          iSF          SCF 
##  5.788700891  5.769495779  5.415682952  5.146832115  5.130163367  5.032195669 
##         CBar        sDist       S.Snap       S.Slap       S.Wrst         iSCF 
##  4.748727990  4.509959753  4.461433069  4.313212598  4.277235774  4.248517713 
##       Shifts           CF           GA          iCF        iSF.1          GVA 
##  4.215448667  4.180624326  4.139789561  4.063415474  3.992185943  3.974876454 
##          RSA          DSA        DftRd           HF          SCA        iMiss 
##  3.955751691  3.834853297  3.813739363  3.810778951  3.775163741  3.690408899 
##           A2         Wide         GS.G           PS          iRB          ixG 
##  3.676290786  3.645108607  3.633305438  3.610214150  3.595268660  3.480885471 
##        S.Tip          TOI          GWG         Pace          RSF       iBLK.1 
##  3.346414030  3.329585036  3.221806341  3.217713209  3.174221855  3.025100847 
##          TKA          SH.          iDS          xGA          PTS          RBF 
##  3.018045764  2.974892868  2.971259194  2.835857332  2.741014454  2.735896100 
##        ozFOW         Diff        iHF.1         Over     FOW.Down          DSF 
##  2.722534258  2.685076261  2.666567624  2.664120804  2.579676300  2.561030820 
##        iCF.1         PENT         iBLK         PEND          iHA          DPS 
##  2.557349361  2.528110773  2.526564711  2.508840054  2.449815740  2.435116068 
##        iPENT        iPenD         F.60           HA         Pct.            A 
##  2.415971545  2.346187236  2.274454030  2.238342994  2.234317528  2.169024021 
##         iTKA           SA         Pass           FA         BLK.      sDist.1 
##  2.125826173  2.120084133  2.072541943  1.997675758  1.957988996  1.926816121 
##       iGVA.1         IPP.       G.Snap           CA         Grit          iFF 
##  1.880160904  1.784961551  1.709598294  1.707276479  1.684499804  1.649876053 
##          PIM          Min          OPS         A.60          PSA        dzFOW 
##  1.623735388  1.607664868  1.570998803  1.485317615  1.454454279  1.438152140 
##         iGVA          iRS         E...     FOL.Down          RBA        iPenT 
##  1.394277913  1.377221218  1.373461469  1.350924783  1.329429951  1.295351023 
##        iSF.2         Post         TOIX      Diff.60        G.Tip       iTKA.1 
##  1.275978881  1.267433739  1.267174176  1.250864070  1.071081145  0.884188406 
##         Misc      S.Dflct          ENG       G.Wrst           A1           Wt 
##  0.829602185  0.777813781  0.776432704  0.621970052  0.611672853  0.577726108 
##            G       S.Wrap        ozFOL        nzFOL       S.Bkhd           Ht 
##  0.502518936  0.282796356 -0.005267235 -0.118169065 -0.129360937 -0.143281981 
##          DAP          PDO       iPenDf        Cntry        Pr.St         Game 
## -0.200812551 -0.234759389 -0.320382475 -0.470494862 -0.682933720 -0.696145437 
##          iHF         iFOL        nzFOW       iFOW.1         Team         iHDf 
## -0.742009387 -0.823680443 -1.075200282 -1.240703523 -1.270223270 -1.438626597 
##     Position 
## -1.503646733